home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Amiga Format CD 32
/
Amiga Format AFCD32 (Nov 1998, Issue 117).iso
/
-seriously_amiga-
/
programming
/
basic
/
blitzc2p
/
blitzc2p.readme
next >
Wrap
Text File
|
1998-08-10
|
22KB
|
437 lines
This is a collection of fast chunky-to-planar routines implemented into blitz
basic for use in any software including commercial and shareware.
There are five standard c2p's which have two versions, one to do a normal c2p
operation and the other to do c2p as well as a clearscreen at the same time
(25-30% faster than seperate clearscreen). There is a sixth c2p that is of a
different design and has special requirements.
c2p030only : Only for use on 68030 cpu's. For 68030 users, this c2p
will perform better than all the others.
c2p030onlyCLS: As above, except that it also clears (to a given longword)
the chunky buffer that it has just read data from.
c2p040only : Only for use on 68040 cpu's, but performs very well on
anything higher. For 68040 users, this c2p will perform
better than all the others.
c2p040onlyCLS: As above, except that it also clears (to a given longword)
the chunky buffer that it has just read data from.
c2p060only : Only for use on 68060 cpu's. It is not, however, the fastest
and you will find that c2p040only and c2pCACHE are faster.
Probably does not perform very well on anything lower than
an 060.
c2p060onlyCLS: As above, except that it also clears (to a given longword)
the chunky buffer that it has just read data from.
c2pGeneric : A generic c2p for use on all cpu's if it is not possible
to isolate the cpu model or to use seperate c2p's for
different cpu's. Performs well on 030 but is somewhat slower
than dedicated routines on higher processors. Mainly to provide
support for 030, as higher processors will be crippled.
c2pGenericCLS: As above, except that it also clears (to a given longword)
the chunky buffer that it has just read data from.
c2p040plus : A kind of generic routine for 040 or higher. Performs generally
quite well on all 040 and 060 cpu's but is not as fast as
dedicated c2p's. Not suitable for 030's.
c2p040plusCLS: As above, except that is also clears (to a given longword)
the chunky buffer that it has just read data from.
c2pCACHE : Designed for use on anything from a 68040 upwards. Second
fastest on 040/25 but joint fastest on 060/50. Perhaps more
geared towards 68060 than anything lower. This c2p is less
flexible and requires special treatment.
c2pCACHECLS : This does not exist as it is not possible to meddle with the
way that the routine specially handles datacaches, which would
result if there were any additional writing to memory such
as when a clearscreen is performed.
Some general performance times for the routines are as follows. These times
are inclusive of having a screen open and being displayed, of the specified
dimensions (* indicates best suitability for the given c2p):
c2p030only : On 68030/50Mhz PAL
68030 only * 320x200 @40.4fps
* 320x256 @30.7fps
On 68040/25Mhz DoublePAL
320x200 @42fps
320x256 @31fps
On 68040/25Mhz PAL
320x200 @44fps
320x256 @36.5fps
c2p030onlyCLS: On 68040/25Mhz PAL
68030 only 320x200 @38.5fps
320x256 @29.7fps
c2p040only : On 68030/50Mhz PAL
68040 320x200 @28.2fps
to 320x256 @21.6fps
68060
On 68040/25Mhz DoublePAL
* 320x200 @49.6fps
* 320x256 @36.2fps
On 68040/25Mhz PAL
* 320x200 @55.3fps
* 320x256 @42.5fps
On 68060/50Mhz PAL
* 320x200 @66.1fps
* 320x256 @50fps
c2p040onlyCLS: On 68040/25Mhz PAL
68040 * 320x200 @49fps (seperate clearscreen ran about 45-46fps)
to * 320x256 @37.1fps
68060
c2p060only : On 68030/50Mhz PAL
68060 only 320x200 @27.9fps
320x256 @21.5fps
On 68040/25Mhz DoublePAL
320x200 @46.0fps
320x256 @34.2fps
On 68050/25Mhz PAL
320x200 @48.2fps
320x256 @37.4fps
On 68060/50Mhz PAL
* 320x200 @66fps
* 320x256 @50fps
c2p060onlyCLS: On 68040/25Mhz PAL
68060 only 320x200 44.5fps
320x256 33.8fps
c2pGeneric : On 68030/50Mhz PAL
all, but * 320x200 @40.1fps
mainly 68030 * 320x256 @30.7fps
On 68040/25Mhz DoublePAL
320x200 @42fps
320x256 @31fps
On 68040/25Mhz PAL
320x200 @44fps
320x256 @34fps
c2pGenericCLS: On 68040/25Mhz PAL
all, but 320x200 @38.8fps
mainly 68030 320x256 @29.7fps
c2p040plus : On 68030/50Mhz PAL
68040 320x200 @24.3fps
to 320x256 @18.5fps
68060
On 68040/25Mhz DoublePAL
* 320x200 @46fps
* 320x256 @34.2fps
On 68040/25Mhz PAL
* 320x200 @49.2fps
* 320x256 @37.9fps
On 68060/50Mhz PAL
* 320x200 @66fps
* 320x256 @50fps
c2p040plusCLS: On 68040/25Mhz PAL
68040 * 320x200 @45.6fps
to * 320x256 @35fps
68060
c2pCACHE : On 68030/50Mhz PAL
68040 320x200 @23.5fps
to 320x256 @18.0fps
68060
On 68040/25Mhz DoublePAL
* 320x200 @47.1fps
* 320x256 @35.3fps
On 68040/25Mhz PAL
* 320x200 @50fps
* 320x256 @38.3fps
On 68060/50Mhz PAL
* 320x200 @66.1fps
* 320x256 @49.6fps
For 68030 owners, do not use c2p040plus, c2p040only, c2p060only, or c2pCACHE.
These will give very bad performance on that cpu.
All of the routines except for c2pCACHE allow you to specify the size of the
chunky-to-planar operation by way of a c2pRoutineInit{} statement, where
`Routine' is the name of the routine (e.g. c2p040onlyInit{}). If you alter the
size of the c2p operation you should generally also alter the size of your planar
destination bitmap to be equal.
It is, however, possible to have a taller planar bitmap than the height of the
chunky-to-planar operation. #c2pBPLSIZE has to be altered to reflect this. The
planar height must always be equal to or greater than the chunky height.
Each c2p routine has two inputs. The first parameter is the address of the chunky
buffer and the second parameter is the address of the planar buffer. Planar
memory must be contiguous so I suggest initialising a bank or reserving some
memory, and then using CludgeBitmap. The inputs to the init statements are the
width and height of the chunky buffer, hense the size of the c2p operation.
The init routine only needs to be called once in a program for any number of c2p
calls.
c2pCACHE is different in that you must specify operation size in constants which
cannot easily be altered during the running of the program, so you are restricted
to one size of operation per program run.
All you have to do to setup a c2p operation is something along these lines (for
example):
InitBank 2,320*256,$10000 ; Fastram chunky buffer
InitBank 0,320*256,$10002 ; Chipram planar buffer
CludgeBitmap 0,320,256,8,Bank(0)
c2pGenericInit{320,256}
c2pGeneric{Bank(2),Bank(0)}
Of course, replace the c2pGeneric statements with the ones for the relevent c2p
that you are using.
The only exception to this is c2pCACHE. This requires that you cludge bitmaps to
8 bytes past the start of the planar buffer, and that you tell the c2pCACHE
routine to output to an address 4 bytes past the start of the planar buffer. So
you have to allow for this by reserving a little extra memory. Like
this:
InitBank 2,320*256,$10000 ; Fastram chunky buffer
InitBank 0,(320*256)+8,$10002 ; Chipram planar buffer
CludgeBitmap 0,320,256,8,Bank(0)+8
c2pCACHE{Bank(2),Bank(0)+4}
As well as c2pCACHE having to be set up with constants, there is also no
clearscreen version because it is not possible to implement it due to the nature
of the way the c2p works.
Generally you should ensure that the base address of a planar bitmap's bitplane
data is aligned to the nearest 64 pixels. Reserving some memory with AllocMem or
InitBank usually seems to do this very reliably. c2pCACHE requires that you
create bitmaps at 8 bytes past the start of the data, and that you begin the c2p
operation at 4 bytes past the start. This is to ensure that the data being
displayed is 64-bit aligned otherwise you would get a lower datafetch.
In amigamode, if the first longword of data that is being displayed is from a
64-bit aligned address, the o/s will use 64-bit datafetch which means faster
chunky-to-planar conversion. If you begin to scroll the display with hardware
scrolling and you go beyond 32 pixels, the first longword being displayed will no
longer be 64-bit aligned, and so the o/s will automatically switch to fetchmode 1
or 2 (32-bit datafetch), which will slow down the c2p. More horrifically, the o/s
will not use normally use fetchmode 0 but YOU should make sure that if you set
the datafetch you do NOT use datafetch 0 because that will at least double the
time it takes to do the c2p operation, and that is bad news.
To do scrolling with chunky screens it is not normally the best idea to use
hardware scrolling. The c2p's do not have a line modulo so you would have to make
your planar bitmap 64 pixels wider which means a further 64x200 or 64x256 area to
be converted. This is also a waste because one longword in chunky is only 4
pixels, so a harware scroll of 0..3 is normally all that is requires. So the
remaining 60 pixels are a total waste. As such, I recommend using software
scrolling and generally speaking, if you have enough power to use
chunky-to-planar well then you should also be thinking of refreshing the whole
screen every frame rather than any of the traditional scroll methods. Taking a
leap to using chunky is also to take a leap towards other factors which come as
part of the package. Screens are normally fully refreshed each frame, scrolling
is done in software, blits are done with the cpu and generally there is cpu
horsepower to back this all up. 030/50's are generally going to be a little
limited in what can be achieved with a decent screensize. I suggest 040/25 is the
entry-level for chunky-to-planar equipped software, unless you have direct output
to a graphics card which does not therefore require any data-conversion.
For a purely generic setup, use the c2pGeneric routine. It will, however, be
quite crippling to 040 or 060 processors but will better support the low end. To
take things one stage further, use also c2p040plus which is a generic routine for
040-060 cpu's. To take it to the next level you should be looking to have a
specific routine for each cpu. For 030's use c2p030only and for 040's use
c2p040only. It seems that c2p040only is actually faster than c2p060only when
running on a 68060/50, but there is hardly anything in it so take your pick.
c2pCACHE is another replacement possibility for c2p040plus and is faster but less
flexible. Certainly you don't need to support ALL of the c2p's in your software
as there is quite a lot of overlap and it may come down to personal taste.
Personally I would use c2p030only for anything below 040, and c2p040only for
anything from 040upwards. If I had to choose ONE generic c2p I would go for
c2p040plus as it performs slighter better than c2pCACHE when used on 030's,
although either of them on 030 are pretty poor, so I would generally target the
software at 040 upwards.
Generally speaking, the clearscreen routines save you between 3 and 5 frames per
second compared with having to do a seperate clearscreen routine. Time is mainly
gained by minor pipelining and the fact that the c2p routine is already handling
and setting up the loop. All that has been done to facilitate the clearscreen is
that (in most cases) a7 is loaded with #clearscreento, which is a longword, and
then move.l (a0)+,Dn is converted to move.l (a0),Dn : move.l a7,(a0)+ ; or if it
is the 030only or generic routine, then it has been converted to move.l (a0),Dn :
move.l #clearscreento,(a0)+, because those routines do not have a7 spare.
It is possible to do a screencopy at the same time as the c2p, but this is not
feasible on c2p030only or c2pGeneric as there needs to be a spare register.
Therefore, a screencopy in place of the clearscreen (which will also do the same
thing as a clearscreen, effectively) is only viable on 040 upwards, and judging
by the time it takes it may be better suited to 060 only. It is therefore
suggested that it might be faster to do a seperate screencopy which is perhaps
hardcoded and may use move16, which may equal or surpass the time that might be
saved by doing the screencopy at the same time as the c2p. I HAVE done a
screencopy test using c2p040only, in which move.l (a0)+,Dn has been converted to
move.l (a0),Dn : move.l (a7)+,(a0)+ ; and it seems to perform @44.3fps for
320x200, or 34.1fps for 320x256 (040/25 results). This is an extra 3 frames per
second on top of the c2pCLS time, or about 9-10fps for the copy compared with a
c2p that does not do anything additional (c2p040only). If you can do a screencopy
seperately, perhaps using movem or move16, faster than this on 040/25, then I
suggest you do that rather than modify the c2p. Judging by the time it takes and
the number of chunky blits it gobbles up I would suggest that fullscreen copy is
not very viable on anything lower than 040 and is probably questionable on
040/25.
If you have a horizontal strip at the top or bottom or even middle of your
display that does not need to be clearscreened and yet is updated a lot, use a
clearscreening c2p for the main game area and a non-clearscreening one for the
panel area.
When it comes to chunky blitting you need to take into account the processor you
are working on. If you have anything from 68040 upwards it is faster to have mask
data (same size as the graphic) and to write longwords to non-aligned addresses,
than to try generating the mask on-the-fly. The code: move.l (a2)+,d0 : move.l
d0,(a1) : move.l (a0)+,d1 : move.l d1,(a1)+ ; will do one longword of masked blit
to anywhere on the screen, about 2-3 times faster than if you try to generate the
mask from the source data. Also, writing to byte addresses is probably not
supported on 68000, I'm not sure about 68020. But if you do it with a copyback
cache it is very quick, so that masked blits are only about 30% slower than
unmasked ones. If you are not going to write to non-aligned addresses you have to
do shifting or rotating in the cpu, which if using mask data means the mask as
well. This takes further time. But these memory-intensive methods may not prove
to be quite so efficient on 030's as they do not have a copyback cache.
I did not write any of the c2p's myself, only the minor modifications and the
example program. You are free to use them all in any of your productions,
freeware, shareware, and even commercialware. I hope you are thankful to those
talented few that have mastered their craft in making these c2p's and for
releasing them for public use.
Please find also enclosed in this archive a demonstration program. There is an
040 version for 040-to-060, and an 030 version. This program will use a
clearscreening c2p and will bounce a number of chunky cpu-blitted objects around
the screen. The blit routines do not do any clipping and the loop for movement
and rendering of the objects is currently hardcoded into a single statement. This
is quite a bit faster than calling a statement for every object.
There are some constants which you can alter. The demonstration program has the
facility to have a planar bitmap height larger than the chunky bitmap height. You
must not allow it to be smaller, however! #planarheight should be >= #c2pBPLY.
If you use: #c2pBPLY=200 : #planarheight=256 ; the routine will do a c2p
operation on the first 200 lines and leave the bottom 56 lines as they are. This
shows how you can use the verticle modulo should you need to.
There are other constants to alter. #iterations is how many loops will be done
before the program exits. #objcount is the number of cpu-blit objects that will
be moved and drawn. Refer to the example 040/25 results for guidelines as to what
to set this at. #objwidth and #objheight are the size of the objects. They must
not be larger than the size of the chunky buffer and preferably should be at
least about 16 pixels smaller in both dimensions for the movement routine to work
properly. #objwidth must be a multiple of 4 and must not be smaller than 4.
#objheight does not need to be a multiple of anything but must not be smaller
than 1. The routine currently has constants which will render 85 32x32 256-colour
masked objects with a screen size of 320x240. You should use PAL as preferable to
DoublePAL if you want a higher framerate. 320x200 will yield even higher results.
If you alter the chunky height don't forget about the planarheight.
#objmasking should be set to either 0 or nonzero, which means you can use
anything other than 0 (1, -1, 20, -50, etc). If zero, there will be no masking
performed and you will attain higher output, but all objects will be solid. If
objmasking is nonzero, there will be masking and any zero pixels will be
transparent. The routine will default to using masking. The mask routine
uses a prerendered mask image, similar to planar masks, except that it is a byte
for every pixel. This is unavoidable if there is to be such speedy processing. It
is perfectly feasible to generate the mask different to the graphic data so that
any number of colours will be transparent. Don't forget that if you specify an
area as solid when it is blank in the graphic, the blank pixel will be drawn.
There is very little difference between the masked and unmasked routines. You
will notice that the masked routine could be simplified as and.l (a2)+,(a1) :
or.l (a0)+,(a1)+ ; but this is illegal in 68000 so I have had to expand it a
little. Currently both routines will allow total flexibility in terms of width
and height (width to nearest 4 pixels), and so use an x loop nested inside a y
loop. If you expand the x loop for a hardcoded version you will get more output,
and similarly with the y loop, although hardcoded large objects take up too much
space to work in cache. Whereas it is possible to do 900 8x8 masked objects on
040/25, it is possible to do 1100 if the routine is hardcoded for 8x8 with both
loops fully unfolded (ie no loops). The larger the objects you use the less
intermediary time is used in setting them up. Lots of small objects take a lot of
processing of the movement table. Typically, there is time to do about two and a
half 320x200 screenfulls of blitting in the time left after the clearscreening
c2p, on 040/25. The objects that the demoroutine uses are generated when you run
the program so they are only basically random pixels and the palettes are fairly
random too.
I have also added a second demo program, which is the same as the first except
that it is dynamic in the number of objects that it displays. You set it a target
frames-per-second rate that you want to know results for. You tell it how many
objects to start off with (must be greater than 0) and how many objects to add
each time (greater than 0). Iterations in the second demo represent how many
loops to do before adding more objects, and this should not be too low or the
routine won't work out wether it's reached the target framerate properly or not.
You have to set a maximum number of objects, because the table has to be
initialised for the eventuality of that many objects becoming displayed. I set it
to 3000 initially which is more than enough for most usual object sizes on all
cpu's.
There are versions of the demo for 030 and for 040(+) as before. You set the
program running and it will progressively add objects and move them and will
keep doing so until it reaches the target framerate. Then it will tell you what
precicely the framerate was at the end and how many objects it managed to display
at that rate. So instead of having to do tests on some constant number of
objects, you can let the routine chug away adding more and more until you are
maxxed out for the selected framerate. The demo will default to doing 16x16
objects, starting with 10, and adding 10 more ever 40 loops. You can set the
starting number of objects (objcount) to a value much closer to what you expect
will be the end result, in order to hasten the report. The starting number of
objects should never excede the maximum number, however, or it will probably
start drawing objects everywhere in memory (65536x65536!).
Unless you are particularly fussy you do NOT need to have a planar doublebuffer
when using the c2p's, so long as they are running quite fast, ie that the overall
routine is not slowing below 25fps, or not much anyway. There will be a slight
flicker on one or two lines of the display, perhaps, but it will not be the
full-screen type of flickering that you can get on planar. Yes, the c2p's are
outputting to planar but the way they do it seems to minimise flicker. I
personally do not use a doublebuffer and I hardly notice that I haven't. When
you're in the middle of all the action you won't notice either so long as things
don't slow down too much. Even if the overall routine slows down the c2p will
still take the same amount of time so it should be okay. So you can probably cut
out the time it takes to do screen swapping or other doublebuffer methods. Of
course, if you have a graphics card, it is fairly normal to refresh straight into
the display and people have reported that there is little or no flicker
whatsoever.
I would be interested to know how any of these routines perform on your
specification of Amiga, and particularly how well the clearscreening c2p's do on
68060's, ie, does it clearscreen `for free'. Problems or ideas, give me a yell.
If you get any problems implementing or using or adapting or modifying the
routines, email at paul@stationone.demon.co.uk
Enjoy.